32 research outputs found

    KERT: Automatic Extraction and Ranking of Topical Keyphrases from Content-Representative Document Titles

    Full text link
    We introduce KERT (Keyphrase Extraction and Ranking by Topic), a framework for topical keyphrase generation and ranking. By shifting from the unigram-centric traditional methods of unsupervised keyphrase extraction to a phrase-centric approach, we are able to directly compare and rank phrases of different lengths. We construct a topical keyphrase ranking function which implements the four criteria that represent high quality topical keyphrases (coverage, purity, phraseness, and completeness). The effectiveness of our approach is demonstrated on two collections of content-representative titles in the domains of Computer Science and Physics.Comment: 9 page

    Evaluating Robustness of Dialogue Summarization Models in the Presence of Naturally Occurring Variations

    Full text link
    Dialogue summarization task involves summarizing long conversations while preserving the most salient information. Real-life dialogues often involve naturally occurring variations (e.g., repetitions, hesitations) and existing dialogue summarization models suffer from performance drop on such conversations. In this study, we systematically investigate the impact of such variations on state-of-the-art dialogue summarization models using publicly available datasets. To simulate real-life variations, we introduce two types of perturbations: utterance-level perturbations that modify individual utterances with errors and language variations, and dialogue-level perturbations that add non-informative exchanges (e.g., repetitions, greetings). We conduct our analysis along three dimensions of robustness: consistency, saliency, and faithfulness, which capture different aspects of the summarization model's performance. We find that both fine-tuned and instruction-tuned models are affected by input variations, with the latter being more susceptible, particularly to dialogue-level perturbations. We also validate our findings via human evaluation. Finally, we investigate if the robustness of fine-tuned models can be improved by training them with a fraction of perturbed data and observe that this approach is insufficient to address robustness challenges with current models and thus warrants a more thorough investigation to identify better solutions. Overall, our work highlights robustness challenges in dialogue summarization and provides insights for future research

    Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

    Full text link
    Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a no-code system, making NLP accessible to non-experts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient -- from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.Comment: 7 pages, 2 figure

    SCENE: Structural Conversation Evolution Network

    Get PDF
    It???s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ???what do people talk about???? focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ???how??? people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content

    Graph-based Classification on Heterogeneous Information Networks

    Get PDF
    A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that strongly-typed heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for part of the objects. Learning from such labeled and unlabeled data via classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied over decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only part of the objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graph-based regularization framework, GNetClass, is proposed to model the link structure in information networks with arbitrary network schema and number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing state-of-the-art methods.unpublishedis peer reviewe

    Discovering latent topical phrases in document collections and networks with text components: Leveraging text mining and information network analysis for human oriented applications

    Get PDF
    One of the major challenges of mining topics from a large corpus is the quality of the constructed topics. While phrase-generating approaches generally produce high quality output, they do not scale very well with the size of the data. Thus, the state of the art solutions usually rely upon scalable unigram-generating methods, which do not produce high quality human-readable topics, or are forced to use external knowledge bases. Furthermore, while document collections naturally contain topics at different levels of granularity (general vs. specific), very few traditional methods focus on generating high quality hierarchical topic structures. This dissertation presents a series of approaches that directly addresses these challenges of generating high quality phrase-based topics, both as a flat set and organized as a hierarchy, as well as some potential applications. First, we describe a framework that generates high-quality topics represented by integrated lists of mixed-length phrases. The key is adapting a phrase-centric view towards the construction and ranking of topical phrases. The approach is domain-independent, and requires neither expert supervision nor an external knowledge base. The framework is initially constructed to work on collections of short texts, such as titles of scientific documents. However, we then show how the framework can be easily and robustly extended to work on collections of longer texts, and demonstrate its applicability to human needs with a task-centric evaluation. The dissertation then addresses the need to move beyond generating a flat set of topics, and present an approach to constructing hierarchical topics, which extends the phrase-centric approach to create high quality phrases at varying levels of granularity. Another application of this technique is then presented: the task of entity role discovery. By tying entities in a community to topical phrases, users are able to explicitly understand both how and why individual entities are ranked within a specific community. A final extension is then described, which is a combined approach for constructing the hierarchy, which uses entity link information to improve the hierarchy quality

    SCENE: Structural Conversation Evolution NEtwork

    No full text
    Abstract—It’s not just what you say, but it is how you say it. To date, the majority of the Instant Message (IM) analysis and research has focused on the content of the conversation.The main research question has been, ‘what do people talk about? ’ focusing on topic extraction and topic modeling. While content is clearly critical for many real-world applications, we have largely ignored identifying ‘how ’ people communicate. Conversation structure and communication patterns provide deep insight into how conversations evolve, and how the content is shared. Motivated by theoretical work from psychology and linguistics in the area of conversation alignment, we introduce SCENE, an evolution network approach to extract knowledge from a conversation network. We demonstrate the potential of our approach by taking the task of matching conversation partners. We find that SCENE is more successful because, in contrast to existing approaches, SCENE treats a conversation as an evolving, rather than a static document, and focuses on the structural elements of the conversation instead of being tied to the specific content. I
    corecore